An Optimal Feature Set for Stylometry-based Style Change detection at Document and Sentence Level
نویسندگان
چکیده
Writing style change detection models focus on determining the number of authors documents with or without known authors. Determining exact contributing in writing a document particularly when contribute short texts form sentence is still challenging because lack standardized feature sets able to discriminate between works Therefore, task identifying best set for all tasks considered important. This paper sought determine tasks; separating several changes (multi-authorship) from any (single-authorship), and location case multi-authorship. We performed exploratory research existing stylometric features level features. Document were extracted used separate single authored multi-authored documents, while answer question To this question, we trained random forest classifier rank separately, applied an ablation test top 15 using k-means clustering algorithm confirm effect these model performance. The study found out that was provided by ensemble including repetitions (num_sentence_repetitions) as most determinant feature, 5-grams, 4-grams, Special_character, sentence_begin_lower, sentence_begin_upper, diversity, automated_readability_index, parenthesis_count, first_word_uppercase, lensear_write_formula, dale_chall_readability, difficult_words, type_token_ratio. These ranked experiment one. On other hand, fifteen based ranks dale_chall_readability grade, check_available_vowel, flesch_kincaid colon_count, verbs, bigrams, alphabets, personal pronouns, coordinating conjunctions, interjections, modals, type_token ratio punctuations_count. Consequently, optimal results features, check_available_vowels, punctuations_counts, parenthesis count, conjunctions colon count.
منابع مشابه
Stylometry-based Fraud and Plagiarism Detection for Learning at Scale
Fraud detection in free and natural text submissions is a major challenge for educators in general. It is even more challenging to detect plagiarism at scale and in online classes such as Massive Open Online Courses. In this paper, we introduce a novel method that analyses the writing style of an author (stylometry) to identify plagiarism. We will show that our system scales to thousands of sub...
متن کاملDocument-to-Sentence Level Technique for Novelty Detection
Novelty identification is accustomed to distinguishing novel data from an approaching stream of documents. In this study, we proposed a novel methodology for document-level novelty identification by utilizing document-to-sentence-level strategy. This work first splits a document into sentences, decides the novelty of every sentence, then registers the record-level novelty score in view of an al...
متن کاملA Corpus-Independent Feature Set for Style-Based Text Categorization
We suggest a corpus-independent feature set appropriate for style-based text categorization problems. To achieve this, we introduce a new measure on linguistic features, called stability, which captures the extent to which a language element, such as a word or syntactic construct, is replaceable by semantically equivalent elements. This measure may be perceived as quantifying the degree of avai...
متن کاملusing contextual information for unsupervised change detection using multitempolar sar images based on clustering and level set methods
in this research, the framework is presented for unsupervised change detection using multitemporal sar images based on integration clustering and level set methods. spatial correlation between pixels were considered by using contextual information. also as proposed method was used integration of gustafson-kessel clustering techniques (gkc) and level set methods for change detection. using clust...
متن کاملFeature Set Reduction for Document Classification Problems
With a growing amount of electronic documents available, there is a need to classify documents automatically. In growing text classification applications, important-term selection is a critical task for the classifier performance. Although many different techniques and heuristics have been developed, this paper shows that many of them are just a sub-set of more advanced methods originating in t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: International journal of scientific research in computer science, engineering and information technology
سال: 2022
ISSN: ['2456-3307']
DOI: https://doi.org/10.32628/cseit228617